-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi+refactor: persistent peer manager #5700
multi+refactor: persistent peer manager #5700
Conversation
Visit https://dashboard.github.orijtech.com?back=0&pr=5700&remote=true&repo=ellemouton%2Flnd to see benchmark details. |
6998d8a
to
601ffa2
Compare
601ffa2
to
fccacb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cACK, I like the direction where we start moving code out of the server.go
, def involves more work tho. My question is, since we are already here, can we move all connection management into peer?
Mind elaborating? Do you mean in general or specifically in this PR?
Very happy to do so! Do you think it should all be in 1 PR? |
Meant in this PR. Since we are moving the code outside of
It depends. I'd think about how emergent issue #5377 is. If it's a pressing issue, I'd suggest we move to #5538 and land it first, since it's almost done and has been reviewed. If not, we might as well take the chance to refactor the connection manager here, moving it into its own package. This will def take more time as the scale is large, but we should do it imo. Another option is to continue what you have here, that we move out the persistent manager as the first step. Guess it's a judgment call from @Roasbeef . |
155c941
to
9f1d37c
Compare
363859c
to
526f660
Compare
24a11d5
to
a618cc8
Compare
I'd argue it is pressing. I still have a handful of TOR peers that refuse to connect back to me even after almost 4 months. Since many nodes have IP address changes once in a while, I assume they have the same issue without ever noticing it. |
holding off on this PR until 0.15. Working on the solution in #5538 again instead 👍 |
I'd also argue it's quite pressing, I have several users of my wallet which have encountered this issue. |
4c4b743
to
f1f7bd8
Compare
This commit moves over the logic in main server's cancelConnReqs function over to the PersistentPeerManager and adds a unit test for it.
In this commit, we take note of the fact that PersistentPeerManager.DelPeer(pubKey) is always preceeded by PersistentPeerManager.CancelConnReqs(pubKey). And so DelPeer is adapted to always cancel the peers connReqs.
Since the PersistentPeerManager will always lookup persisted advertised addresses for any peer we add to it, there is no need to do this outside the manager.
In order to prevent a situation where an old address map is used for connection request creations, always cancel any previous retry cancelles before initialising a new one.
Will update to 0.16 once the file for it is in the code base.
ac5eebf
to
077d9ca
Compare
Ok, this is finally ready for re-review :) I have restructured the commits significantly which will hopefully make review easier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpointing review here
"sync" | ||
|
||
"github.com/btcsuite/btcd/btcec/v2" | ||
"github.com/lightningnetwork/lnd/routing/route" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think route.Vertex
needs to be introduced
@@ -359,7 +356,9 @@ func (s *server) updatePersistentPeerAddrs() error { | |||
// We only care about updates from | |||
// our persistentPeers. | |||
s.mu.RLock() | |||
_, ok := s.persistentPeers[pubKeyStr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible to remove the server mutex (un)locking calls, since those seem to be handled by the new manager. Though maybe it's best to not change any assumptions here and keep it how you did it
relaxedBackoff := computeNextBackoff(currentBackoff, maxBackoff) | ||
relaxedBackoff -= connDuration | ||
|
||
if relaxedBackoff > maxBackoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be minBackoff
|
||
// retryCanceller is used to cancel any retry attempts with backoffs | ||
// that are still maturing. | ||
retryCanceller *chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use a regular channel and a boolean that, when set, means that the channel was closed. They both achieve the same thing, but I think the bool approach is nicer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass of this PR focusing on concurrency issues first. I have to say that I find it hard to be certain that there are none with the extensive use of the lock, goroutines and cancel channel.
After my previous review, I mentioned:
I wonder if it would be easier to understand with a single event loop per peer that receives updates via a channel.
I still think that would help a lot. Not just for review, but also for future devs working on this code.
Generally speaking, the connection handling for a peer is independent of all other peers. If that would be reflected in the code as a layer, it's already a bit easier to see how things work.
Then within the code for handling a single peer, I think that a single event loop that deals with retries, new addresses, backoff and additional connection requests will make the logic much more transparent.
peerKey := route.NewVertex(pubKey) | ||
|
||
// Fetch any stored addresses we may have for this peer. | ||
advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to miss a graph update in between this fetch and adding the peer to the list of connections? Maybe safer to swap the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's safer to add it to the map and then update it with the fetched addrs
otherwise i think it's possible that:
- we call
ConnectPeer
in the server AddPeer
is called and this ends up racing
|
||
backoff := m.cfg.MinBackoff | ||
if peer, ok := m.conns[peerKey]; ok { | ||
backoff = peer.backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the peer is already being tracked, does the fetch call above still add something? Perhaps setPeerAddrsUnsafe
can just append new addrs
?
// connection requests for. So create new connection requests for those. | ||
// If there is more than one address in the address map, stagger the | ||
// creation of the connection requests for those. | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be added to the waitgroup for when Stop
is called? The actual Connect
call is async too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, I guess because of this comment which I don't entirely understand:
Lines 4039 to 4041 in 22fec76
// We choose not to wait group this go routine since the Connect | |
// call can stall for arbitrarily long if we shutdown while an | |
// outbound connection attempt is being made. |
|
||
// We choose not to wait group this go routine since the Connect call | ||
// can stall for arbitrarily long if we shutdown while an outbound | ||
// connection attempt is being made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why this is. Is there a timeout missing?
// Next, check to see if we have any outstanding persistent connection | ||
// requests to this peer. If so, then we'll remove all of these | ||
// connection requests, and also delete the entry from the map. | ||
if len(peer.connReqs) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point, there may still be a ConnectPeer
goroutine blocked on obtaining the lock. After this function returns and unblocks, ConnectPeer
may still add something to the connReqs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this seems possible, if the goroutine spawned by ConnectPeer
doesn't get canceled by the chan and is waiting on the mutex lock
FetchNodeAdvertisedAddrs func(route.Vertex) ([]net.Addr, error) | ||
} | ||
|
||
// PersistentPeerManager manages persistent peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a basic description of what is/makes a peer persistent?
} | ||
|
||
peer.connReqs = updatedConnReqs | ||
cancelChan := peer.getRetryCanceller() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be a race condition with running goroutines created by the previous ConnectPeer
that are blocked on the mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with peer.connReqs
?
// exist. | ||
cancelChan := peer.getRetryCanceller() | ||
|
||
// We choose not to wait group this go routine since the Connect call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the ConnectPeer
call doing its connects in a goroutine?
break | ||
} | ||
|
||
// If we are, the peer's address won't be known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this changes the behavior since now the peermgr will attempt to connect to the incoming address - before this change, the attempt wasn't made.
also, the original logic is slightly off. If we get an incoming connection and don't have any other addresses for the peer, we'll attempt to connect to the addr+port, but they may not actually be listening on the port given how tcp works. The issue seems to be that when we create a link node, we'll use the remote address as the address to store and expect to be able to connect to it
peerKey := route.NewVertex(pubKey) | ||
|
||
// Fetch any stored addresses we may have for this peer. | ||
advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's safer to add it to the map and then update it with the fetched addrs
otherwise i think it's possible that:
- we call
ConnectPeer
in the server AddPeer
is called and this ends up racing
ticker := time.NewTicker(m.cfg.MultiAddrConnectionStagger) | ||
defer ticker.Stop() | ||
|
||
for _, addr := range addrMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem like a big deal, but it's possible that the previous goroutine adds to connReqs
even though the cancelChan
is closed
@bhandras: review reminder |
!lightninglabs-deploy mute |
muting for a bit. Will pick this up again when my plate clears up a bit |
closing for now. Can re-open once re-prio'd |
This PR adds a
PersistentPeerManager
to the server and refactors all persistent peer logic to be handled by thePersistentPeerManager
. The end result is thatpersistentPeers
,persistentPeersBackoff
,persistentConnReqs
andpersistentRetryCancels
are all removed from theserver
struct and replaced by aPersistentPeerManager
.